arXiv: 1504.00091 v2 [stat.ML] 4Jul2015 


Learning in the Presence of Corruption 


Brendan van Rooyen*’t Robert C. Williamson* 

*The Australian National University ^National ICT Australia 

{ brendan.vanrooyen, bob.Williamson }@nicta.com.au 


Abstract 

In supervised learning one wishes to identify a pattern present in a joint distribution P, of instances, 
label pairs, by providing a function / from instances to labels that has low risk E pl(y, /(*)). To do so, 
the learner is given access to n iid samples drawn from P. In many real world problems clean samples are 
not available. Rather, the learner is given access to samples from a corrupted distribution P from which to 
learn, while the goal of predicting the clean pattern remains. There are many different types of corruption 
one can consider, and as of yet there is no general means to compare the relative ease of learning under these 
different corruption processes. In this paper we develop a general framework for tackling such problems 
as well as introducing upper and lower bounds on the risk for learning in the presence of corruption. Our 
ultimate goal is to be able to make informed economic decisions in regards to the acquisition of data sets. 
For a certain subclass of corruption processes (those that are reconstructible) we achieve this goal in a 
particular sense. Our lower bounds are in terms of the coefficient of ergodicity ED, a simple to calculate 
property of stochastic matrices. Our upper bounds proceed via a generalization of the method of unbiased 
estimators appearing in 1301 and implicit in the earlier work ED. 


1 Introduction 

The goal of supervised learning is to find a function in some hypothesis class that predicts a relationship 
between instances and labels. Such a function should have low average loss according to the true distribution 
of instances and labels, P. The learner is not given direct access to P, but rather a training set comprising n 
iid samples from P. There are many algorithms for solving this problem (for example empirical risk mini¬ 
mization) and this problem is well understood. 

There are many other types of data one could learn from. For example in semi-supervised learning (H 
the learner is given n instance label pairs and m instances devoid of labels. In learning with noisy labels 
(2131 ED, the learner observes instance label pairs where the observed labels have been corrupted by some 
noise process. There are many other variants including, but not limited to, learning with label proportions 
eh, learning with partial labels m, multiple instance learning ll26l as well as combinations of the above. 

What is currently lacking is a general theory of learning from corrupted data, as well as means to com¬ 
pare the relative usefulness of different data types. Such a theory is required if one wishes to make informed 
economic decisions on which data sets to acquire. For example, are n clean datum better or worse than n\ 
noisy labels and ri 2 partial labels? 

To answer this question we first place the problem of corrupted learning into the abstract language of sta¬ 
tistical decision theory. We then develop general lower and upper bounds on the risk relative to the amount 
of corruption of the clean data. Finally we show examples of problems that fit into this abstract framework. 

The main contributions of this paper are: 

• Novel, general means to construct methods for learning from corrupted data based on a generalization 
of the method of unbiased estimators presented in |30l and implicit in the earlier work ll24ll (theorems 

|T|and[2]i 
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• Novel lower bounds on the risk of corrupted learning (theorem [^ji. 

• Means to understand compositions of corruptions (lemmas [6|and[X0|). 

• Upper and lower bounds on the risk of learning from combinations of corrupted data (theorems [3] and 

ID- 

• Analyses of the tightness of the above bounds. 

In doing so we provide answers to our central question of how to rank different types of corrupted data, 
through the utilization of our upper or lower bounds. While not the complete story for all problems, the 
contributions outlined above make progress toward the final goal of being able to make informed economic 
decisions regarding the acquisition of data sets. All proofs omitted in the main text appear in the appendix. 


2 The Decision Theoretic Framework 

Decision theory deals with the general problem of decision making under uncertainty. One starts with a set 
0 of possible true hypotheses (only one of which is actually true) as well as set A of actions available to the 
decision maker. Prior to acting, the decision maker performs an experiment, the outcome of which is assumed 
to be related to the true hypothesis, and observes z in an observation space O. Ultimately the decision maker 
makes act a and incurs loss L(9, a), with 9 the unknown true hypothesis. We model the relationships between 
unknowns and the results of experiments with Markov kernels Il34ll25ll29l[l3l . The abstract development that 
follows is necessary in order to place a wide range of corruption processes into a single framework so that 
they may be compared. 

2.1 Markov Kernels 

As much of our focus will be on noise on the labels and not on the instances, henceforth we will assume we 
are only working with finite sets. 

Denote by P(X) the set of probability distributions on a set X. Define a Markov kernel from a set A' to a set 

Y (denoted by X Y) to be a function T : X -r P(Y). Denote the set of all Markov kernels from X to 

Y by M (X, Y). Every function / : X —► Y defines a Markov kernel T : X Y with T(X) = 6f( x ), a 
point mass on f(x). Given two Markov kernels 1\ : X Y and T 2 : Y Z we can compose them to form 
T 2 Xi : X Z by taking 

^T 2 T 1 {x)f = E y ~Ti(z)IE s ~r 2 (j/)/(z) 

for all / : Z —> R. One can also combine Markov kernels in parallel. If P £ P(X) and Q £ P(A'), denote 
the product distribution by P ® Q. If T) : A' Y, i € [1; n], are Markov kernels then (g) " =1 T) : A" Y n 
with Z>^-iTi(x n ) = Ti(xi) (g) • • • (g) T n (x n ). By restricting ourselves to finite sets, distributions can be 
represented by vectors, Markov kernels by column stochastic matrices (positive matrices with column sum 
1) and composition by matrix multiplication. An experiment on 0 is any Markov kernel with domain 0 and 
a learning algorithm A is any Markov kernel with co-domain A. Finally, from any experiment e : 0 O 
we define the replicated experiment e n : 0 O n , n £ {1, 2,... }, with e n (9) = e{9) n the n-fold product 
of e(6). 

2.2 Loss and Risk 

One assesses the consequence of actions through a loss L : 0 x A —> R. It is sometimes useful to work 
with losses in curried form. From any loss L and action a £ A, define L a £ R e with L a (9) = L(9,a). 
We measure the size of a loss function by its supremum norm ||F||oo = sup e a \L(9, a)|. If P £ P(0) and 
Q £ P(A) we overload our notation with L(P , Q ) = Kg^pK a ^QL(6, a). 

Normally, we are not interested in the absolute loss of an action, rather its loss relative to the best action. 
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defined formally as the regret AL(9,a) = L(9,a) — inf a / L(6/a'). We measure the performance of an 
algorithm A by the risk 

TlL{e, 9 , A) = Ez~ e ( 0 )lE(Wi(z) AL(0, a). 

For the sake of comparison by a single number either the max risk or the average risk with respect to a 
distribution Pq £ P(0) can be used. We define a learning problem to be a pair (L, e ) with L : 0 x A —x R. 
a loss and e : 0 O an experiment. We measure the difficulty of a learning problem by the minimax risk 

7 Z L (e) = inf sup IZl (e, 9, A). 

-4 e 

Normally we are not concerned with the quality of a learning algorithm for observation of a single z £ O. 
Rather we wish to know the rate at which the risk decreases as the number of replications of the experiment 
grows. Hence the prime quantity of interest is r R L (e n ). 


2.3 Statistics vs Machine Learning 

While the ideas of the previous subsections originated in theoretical statistics El ED 0 ED they can be 
readily applied to machine learning problems. The main distinction is that statistics focuses on parametric 
families and loss functions of type L : 0 x 0 -} 1. The goal is to accurately reconstruct parameters. In 
machine learning one is interested in predicting the observations of the experiment well. There the focus is 
on problems with 0 = P(0) and loss functions of the form L(9, a) = E Z ~p g £(z, a), where !:Ox R 
measures how well a predicts the observation z. Our focus is on problems of the second sort, however 
abstractly there is no real difference. Both are just different learning problems. When clear we use £{P. a) 
and L(P, a) interchangeably. 

2.3.1 Supervised Learning 

In Table 1 we explain the mapping of supervised learning into our abstract language. We focus on the problem 
of conditional probability estimation of which learning a binary classifier is a special case. Letting X be the 
instance space and Y the label space we have 


Table 1: Supervised Learning 


Unknowns 0 

Distributions of instance, label pairs, P(X x Y) 

Observation Space O 

n instance label pairs ( X x Y) n . 

Action Space A 

Function class T C P(F) A 

Experiment e 

Maps each P £ P(X x Y) to itself 

Loss L 

L{9, f) = E ( x ,y)~ Pl) e{y, /Or)) 


We have 

TZ L {e n ,P,A) = E s ~pnE f ~ A (s)E( Xty) ~ P e{y, f(x)) - M E {x y) ~ P £(y, f(x)) 

J 

a standard object of study in learning theory (8). 

2.4 Corrupted Learning 

In corrupted learning, rather than observing z £ O, one observes a corrupted z in a different observation 
space O. We model the corruption process through a Markov kernel T : O O and define a corrupted 
learning problem to be the triple (L, e,T). For convenience we define the corrupted experiment e = Te. 
Ideally we wish to compare 'JZ L {e n ) with 'P L (e n ). By general forms of the information processing theorem 
mm Sz,( g n) > ]Z L (e n ), however this does not allow one to rank the utility of different T. 

Even after many years of directed research, in general we can not compute ]Z L ( e n ) exactly, let alone 'JZ L (e n ) 
for general corruptions. Consequently our effort for the remaining turns to upper and lower bounds of 

—L (^n) • 
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3 Upper Bounds for Corrupted Learning 


When convenient we use the shorthand T(P) = P. If30l introduced a method of learning classifiers from 
data subjected to label noise, termed the method of unbiased estimators. Here we show that this method can 
be generalized to other corruptions. Firstly, P (O) C (R°)*, the dual space of R 0 . We use the notation 
(P, /) = E ~~pf(z). From any markov kernel T : O O, we obtain a linear map T : (R°)* — > (R®)* 
with 

(T(a),f) = (a,T*(f)), V/eR 6 

where T*(f)(z) = E ~~T(z)f(z) is the pullback of / by T. In terms of matrices T* is the transpose or adjoint 

ofT. 

Definition 1. A Markov kernel T : O -w O is reconstructible ifT has a left inverse, there exists a linear map 
R : (R®)* — > (R 0 )* such that RT = I. 


Intuitively, T is reconstructible if there is some transformation that “undoes” the effects of T. In general 
R is not a Markov kernel. Many forms of corrupted learning are reconstructible, including semi-supervised 
learning, learning with label noise and learning with partial labels for all but a few pathological cases. The 
reader is directed to 10.1 for worked examples. 


We call a left inverse of T a reconstruction. For concreteness, one can always take 

R= (t*T) _1 T* 


the Moore-Penrose pseudo inverse of T. Reconstructible Markov kernels are exactly those where we can 
transfer a loss function from the clean distribution to the corrupted distribution. We have by properties of 
adjoints 

(P,f) = (RT(P),f) = {T(P),R*(f)}. 

In words, to take expectations of / with samples from P we use the corruption corrected / = R* (/). 

Theorem 1 (Corruption Corrected Loss). For all reconstructible T : O O, loss functions £ : O x A —► R 
and reconstructions R define the corruption corrected loss LOxA-tR, with l a = R* £ a . Then for all 
distributions P £ P (O), £(P, a) = £(P, a). 


We direct the reader to 


10.1 


for some examples of £ for different corruptions. Minimizing £ on a sample S~P 
provides means to learn from corrupted data. Let £ (S, a) = ^ a )’ hie average loss on the sample. 

By an application of the PAC Bayes bound ([28 39 I0| ) one has for all algorithms A : O n A, priors 
7r £ P(A) and distributions P £ P(O) 


E S „pJ{P,A{S)) < E^ P J(S,A(S)) + 

This bound yields the following theorem. 

Theorem 2. For all reconstructible Markov kernels T : O —> O, algorithms A : O n A, priors it £ P(A), 
distributions P £ P (O) and bounded loss functions £ 


E^ Pn £(P,A(S)) < E § „pJ(S,A(S)) + m s^ D KL(A( s ), n ) . 

A similar result also holds with high probability on draws from P n . If A is Empirical Risk Minimization 
(ERM), A is finite and n uniform on A the above analysis yields convergence to the optimum a £ A as 

for learning with corrupted data versus for learning with clean data. Therefore, the ratio j ” measures 
the relative difficulty of corrupted versus clean learning. 
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3.1 Upper Bounds for Combinations of Corrupted Data 

Recall that our final goal is to be able to make informed economic decisions in regarding the acquisition of 
data sets. As such, we wish to quantify the utility of a data set comprising different corrupted data. For 
example in learning with noisy labels out of n datum, there could be m clean, slightly noisy and n :i very 
noisy samples and so on. More generally we assume access to a corrupted sample S, made up of k different 
types of corrupted data, with Si~P ni . 

Theorem 3. Let : O Oi be a collection of k reconstructible Markov kernels. Let Q = 

and O = xJL-, O f-, n = Y^i=i n i am d r i = yy- Then for all algorithms A : O A, priors tt £ P(A), 

distributions P £ P (O) and bounded loss functions t 


E §~Qi(P,A(S)) < 


E 


S~Q 




K\ 


'2E § „qD kl (A(S),tt) 


2=1 


where K = J2 ^ilKi||§o- 

A similar result also holds with high probability on draws from Q. Theorem[3]is a generalization of the final 
bound appearing in fl7l that only pertains to symmetric label noise and binary classification. Theorem [3] 
suggest the following means of choosing data sets. Let c t be the cost of acquiring data corrupted by 'I) and C 
the maximum total cost. First, choose data from the with lowest c, \ \ i, \ \ ^ until picking more violates the 
budget constraint. Then choose data from the second lowest and so on. 


4 Lower Bounds for Corrupted Learning 

Thus far we have developed upper bounds for ERM style algorithms. In particular we have found that recon¬ 
structible corruption does not effect the rate at which learning occurs, it only effects constants in the upper 
bound. Can we do better? Are these constants tight! To answer this question we develop lower bounds for 
corrupted learning. 

Here we review Le Cam’s method j25l a powerful technique for generating lower bounds for learning prob¬ 
lems that very often gives the correct rate and dependence on constants (including being able to reproduce the 
standard VC dimension lower bounds for classification presented in 1271 ). In recent times it has been used 
to establish lower bounds for: differentially private learning lf20l . learning in a distributed set up fsa, func¬ 
tion evaluations required in convex optimization ID as well as generic lower bounds in statistical estimation 
problems ll37l . We show how this method can be extended using the strong data processing theorem l9l 1 15 ll 
to provide a general tool for lower bounding corrupted learning problems. 


4.1 Le Cam’s Method and Minimax Lower Bounds 

Le Cam’s method proceeds by reducing a general learning problem to an easier binary classification problem, 
before relating the best possible performance on this classification problem to the minimax risk. Define the 
separation p : 0 x 0 —> R, p(0 \ 1 62 ) = inf a AL(9i,a) + AL( 92 ,a). The separation measures how hard 


it is to act well against both 9i and 9 2 simultaneously. We have the following (see section 10.4 for a more 
detailed treatment). 

Lemma 1. For all experiments e, loss functions L and 0 \. O 2 £ O 


P L (e)> P (9i,9 2 ) (\-lv(e(e 1 ),e(e 2 )) 


where V is the variational divergence. 
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This lower bound is a trade off between distances measured by p and statistical distances measured by the 
variational divergence. A learning problem is easy if proximity in variational divergence of e(9\) and c( 0 - 2 ) 
(hard to distinguish 9\ and 62 statistically) implies proximity of 0\ and 62 in p (hard to distinguish 0\ and 62 
with actions). 

If there exists 61,62 with e($i) = e(02) and p( 9 i, 02 ) > 0 we instantly get that the minimax regret must 
be positive. For corrupted experiments, if T is not reconstructible it may be the case that Te(0±) = Te(02) 
for some 61,62 ■ Hence we assume that T is reconstructible. 

4.1.1 Replication and Rates 

We wish to lower bound how the risk decreases as n grows. When working with replicated experiments it 
can be advantageous to work with an /-divergence (see section |4.3| > different to variational divergence and 
to invoke a generalized Pinkser inequality l32l . Common choices in theoretical statistics are the Hellinger 
and alpha divergences ll23l as well as the KL divergence l20l . Here we use the variational divergence and the 
following lemma. 

Lemma 2. For all collections of distributions Pi, Qi £ P (Of), i £ [1; k\ 

k 

V(®tiPi, ® k i=iQi) < GO 

i=l 

Here we make use of the specific case where Pi = P and Q, = Q for all i. 

Lemma 3. For all experiments e, loss functions L, 0 \, 62 £ © and n 

n L (e n ) > p(6 1 ,6 2 ) Q - r ^V(e(6 1 ),e(6 2 )) S j . 

To use lemma[3] one defines 9\ = (f>i{n) and 62 = 4 > 2 {n) for n £ [ 0 , 00), with the property 

1 ?? 1 
---y(e(0i),e(0 2 ))> g 

or equivalently V(e (6 1 ), e{ 62 )) < ipp- This yields a lower bound of 

’Kl^u) > -p{^i{n),^ 2 {n)). 

To obtain tight lower bounds, <i> needs to be designed in a problem dependent fashion. However, as our goal 
here is to reason relatively we assume that <fi is given. 

4.1.2 Other Methods for Obtaining Minimax Lower Bounds 

There are many other techniques for lower bounds in terms of functions of pairwise I\L divergences 11381 
(for example Assouad’s method) as well as functions of pairwise f-divergences l23l l. While such methods are 
often required to get tighter lower bounds, all of what follows can be applied to these more intricate lower 
bounding techniques. Therefore, for the sake of conceptual clarity, we proceed with Le Cam’s method. 

4.2 Measuring the Amount of Corruption 

Rather than the experiment e, in corrupted learning we work with the corrupted experiment e. By the infor¬ 
mation processing theorem for /-divergences Il32l . states that 

V(T(P),T(Q)) < V(P,Q), yP,Q 
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Thus any lower bound achieved by Le Cam’s method for e can be directly transferred to one for e. This is 
just a manifestation of theorems presented in lf32][22 1 and alluded to in section 2.4. However, this provides 
us with no means to rank different T. For some T, the information processing theorem can be strengthened, 
in the sense that one can find a(T) < 1 such that 

VP,Q, V(T(P),T(Q)) < a(T)V(P,Q). 

The coefficient a(T) provides a means to measure the amount of corruption present in T. For example if 
T is constant and maps all P to the same distribution, then a(T) =0. If T is an invertible function, then 
a(T) = 1. Together with lemma[3]this strong information processing theorem 031 leads to meaningful lower 
bounds that allow the comparison of different corrupted experiments. 


4.3 A Generic Strong Data Processing Theorem. 

Following ma, we present a strong data processing theorem that works for all /-divergences. 

Definition 2. Let X be a set and f : R+ -) la convex function with /(1) = 0. For all distributions 
P,Q £ P(X) the /-divergence between P and Q is 

Both the variational and KL divergence are examples of / divergences. For fixed T we seek an a(T) such 
that 

D f (T(P),T(Q )) < a(T)D f (P,Q) fP,Q, f. 

To do so we first relate the amount T contracts P and Q to a certain deconstruction for Markov kernels before 
proving when such a deconstruction can occur. 

Lemma 4. For all Markov kernels T : X Y and distributions P,Q £ P(AQ, if there exists F,G £ 
M(X, Y) and X £ [0,1] such that T = XF + (1 - X)G with F(P ) = F(Q) then D f {T{P),T{Q)) < 
(1-A )D f (P,Q). 

Hence the amount T contracts P and Q is related to the amount of T that fixes P and Q. We seek the largest 
A such that a decomposition T = XF + (1 — A )G is always possible, no matter what pair of distributions F 
is required to fix. 

Lemma 5. For all Markov kernels T : X Y define X (T) = min^j min(Tfc 7/j). Then X < A(T) 
if and only if for all pairs of distributions P , Q there exists a decomposition 

T = XF + (1 — X)G 

with F,G £ M{X, Y) and F{P) = F(Q). 

Theorem 4 (Strong Data Processing). For all Markov kernels T : X Y define a(T ) = 1 — A(T). Then 
for all P, Q , /, 

D f (T(P),T(Q))<a(T)D f (P,Q). 

The proof is a simple application of lemma|4]and lemma[5] It is easy to see that 0 < a(T) < 1. Furthermore 
a{T) = 0 if and only if all of the columns of T are the same. While this a may not be the tightest for a given 
/, it is generic and as such can be applied in all lower bounding methods mentioned previously. 


4.4 Relating a to Variational Divergence 


It can be shown lfl5l that a(T) = max Ilil2 V(T(a;i), T{x 2 )) = \ max^j \Tki — 2\,j, the maximum LI 
distance between the columns of A |[32l . Furthermore 


a(T) 


sup 

p.QeP(A-) 


V(T(P),T(Q)) 

V(P,Q) 


= SUP» 

ves IM|i 
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where S = {v : ^ Vi = 0, v 7 ^ 0}. Hence a(T) is the operator 1-norm of T when restricted to S. The above 
also shows that a(T) provides the tightest strong data processing theorem possible when using variational 
divergence, and hence it gives the tightest generic strong data processing theorem. We also have the following 
compositional property of a. 

Lemma 6. For all Markov kernels Tj : X Y and T 2 : Y Z, 

a(T 2 Ti) < a(T 2 )a(Ti) < min(a(T 2 ), a(Ti)). 

Hence T 2 Ti is at least as corrupt as either of the 7j. 

The first use of a(T ) occurs in the work of fl9l where it is called the coefficient of ergodicity and is used 
(much like in 0) to prove rates of convergence of Markov chains to their stationary distribution. 


4.5 Lower bounds Relative to the Amount of Corruption 


Lemma 7. For all experiments e, loss functions L, 0 1; 0 2 £ 0, n and corruptions T : O O 

n L (e n ) > p(9 lt 9 2 ) Q - ^^ V (e(9 1 ),e(9 2 ))^ . 


The proo f is a si mple application of lemma[3]and the strong data processing. Suppose we have proceeded as 
defining 61 = fi(n) and 0 2 = cj> 2 (n) with V(e(&i),e(0 2 )) < h. Letting 9\ = (a(T)n) 


in section 


4.1.1 


and 9 2 = (f> 2 (a(T)n) gives V(e(9i),e(0 2 )) < 


Furthermore 


'Rl^u) > c p(<l>i{u{T)n), f 2 {a{T)n)). 


In words, if ever Le Cam’s method gives a lower bound of f(n) for repetitions of the clean experiment, we 
obtain a lower bound of f(a(T)n) for repetitions of the corrupted experiment. Hence the rate is unaffected, 
only the constants. However, a penalty of factor a(T) is unavoidable no matter what learning algorithm is 
used, suggesting that a(T) is a valid way of measuring the amount of corruption. We summarize the results 
of this section in the following theorem. 

Theorem 5. For all corruptions T : O O and experiments e : 0 ^ O, ifLe Cam’s method yields a lower 
bound TZ L (e n ) > f(n) then ]Z L (e n ) > f(a(T)n). 

In particular if one has a lower bound of -R for the clean problem, as is usual for many machine learning 
problems, theoremplyields a lower bound of . c for the corrupted problem. 

y/a(T)n 


4.6 Lower Bounds for Combinations of Corrupted Data 


As in section 3.1 we present lower bounds for combinations of corrupted data. For example in learning with 
noisy labels out of n datum, there could be rii clean, n 2 slightly noisy and n :i very noisy samples and so on. 


Theorem 6. Let Ti : O Oi, i £ [1; k\, be reconstructible Markov kernels. Let T = with 

n = ^2 i=i IfLe Cam’s method yields a lower bound R L (e n ) > f(n) then 

R L (Te n ) > f(K) where K = ot(Ti)n\ 


As in section 3.1 this bound suggest means of choosing data sets, via the following integer program 

k k 

max^^ a{Tj)ni subject to grii < C 




2 = 1 


where Ci is the cost of acquiring data corrupted by Ti and C is the maximum total cost. This is exactly the 
unbounded knapsack problem lfl8l which admits the following near optimal greedy algorithm. First, choose 
data from the Ti with highest a(T ’ ' 1 until picking more violates the constraints. Then pick from the second 
highest and so on. 
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5 Measuring the Tightness of the Upper Bounds and Lower Bounds 


In the previous sections we have shown upper bounds that depend on 


depend on a(T). Recall from theorem that 


as well as lower bounds that 


£ a = R*£ a , as such the worst case ratio 


is determined by 


the operator norm of R*. For a linear map R : 


define 


Pill 


sup 

veR x 


n^iii 

Hi 


.IPIloo 


sup 

veR x 


| | R.V | | QQ 

IHloo 


which are two operator norms of R. They are equal to the maximum absolute column and row sum of R 
respectively j 6 j- Hence \\R\\i = ||i?*||oo. 

Lemma 8. For all losses l, T : O —> O and reconstructions R, ||^||°° < ||.R* ||oo- 
Lemma 9. IfT : X Y is reconstructive, with reconstruction R, then 


1 

W) 


< V 


f inf 

\m£R a ' 



< Ploo 


The intuition here is if T contracts a particular v £ M A greatly, which would occur if 

inf PP- Q)l|i 

P,Qenx ) ||P - Q||i 

was small (here v = P—Q ), then R* could greatly increase the norm of a loss L However, it need not increase 
the norm of the particular loss of interest. Note that for lower bounds we look at the best case separation of 
columns of T, for upper bounds we essentially use the worst. We also get the following compositional 
theorem. 

Lemma 10. IfT\ : X Y and 7(> \Y-^Z are reconstructible, with reconstructions R. \ and R 2 then T 2 T 1 
is reconstructive with reconstruction R 1 R 2 . Furthermore )a(T ) — PP2II1 < P1II1P2II1. 

Proof. The first statement is obvious. For the first inequality simply use lemma |9]followed by lemma[ 6 ] The 
second inequality is an easy to prove property of operator norms □ 


5.1 Comparing Theorems [2] and [5] 

What we have shown is the following implication, for all reconstructible T 


Ct_ 

\/n 


<n L (en) < 


C2IKII00 


Ci 

V a ( T ) n 


<R L (e n ) < 


PIKlloo 

sfn 


By lemma|9j in the worse case |f 11o, > , and in the “optimistic worst case” we arrive at bounds a factor 

of a{T) apart. We do not know if this is the fault of our upper or lower bounding techniques. However, when 
considering specific i and T this gap is no longer present (see section fi~07T]). 


5.2 Comparing Theorems [3] and [6] 

Assuming ct is the cost of acquiringdata corrupted by T, theorem 6 ]the ranks the utility of different corrup 


tions by 


where as theorem 


ranks by a ^ > . By lemma 


9 is a proxy for 


meaning both 


theorems are “doing the same thing . In theorems [ 6 ] and [3] we have best case and a worst case loss specific 
method for choosing data sets. Theorem [3] combined with lemma [ 8 ] provides a worst case loss insensitive 
method for choosing data sets. 
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6 What if Clean Learning is Fast? 


The preceding largly solves the problem of learning from corrupte d data when learning from the clean distri¬ 
bution occurs at a slow (-J—) rate. The reader is directed to section 

sjn 

corrupted learning also occurs at a fast rate. 


10.12 


for some preliminary work on when 


7 Proper Losses and Convexity 

Definition 3. A loss <:Ox P (O) —> R. is proper if 

P G argminE z ~ P £(z,Q). 

Qer(O) 

It is stricly proper if P is the unique minimizer. 

Proper losses provide suitable surrogate losses for learning problems. All strictly proper losses can be con- 
vexified through the use of the canonical link function | [33l [36l . Ultimately one works with a loss of the 
form 

£(z, v) = v(z ) + \l/(i>)l 

with v G R°, 1 G R 0 the constant function z = 1 and xk : R° ->Ka convex function. 

Theorem 7 (Preservation of Convexity). Let v G R 0 and T : R° -xRfcea convex function. Define the loss 
£(z,v) = v(z) + 'T(u). Then 

l(z , v) = R*v(z) + ^(v). 

Furthermore this loss is convex in v. 

This was first noticed in lfl4l . 

8 Uses in Supervised Learning 

Recall in supervised learning O = X x Y and the goal is to find a function that predicts Y from X with low 
expected loss. Many supervised learning techniques proceed by minimizing a proper loss. Given a suitable 
function class T C l :j (Y) x and a strictly proper loss f they attempt to find 

f* = argminE( Xiy )^p£(y, f(x)). 
fer 

Using the canonical link function and a careful chosen function class, leaves the learner with a convex prob¬ 
lem. If we assume the labels have been corrupted by a corruption T : Y Y, we can correct for the 
corruptions and solve for 

arg minE ( .) pi(y , / (x)) . 
far 

This objective is equivalent to the first and will also be convex. 


9 Conclusion 

We have sought to solve the problem of how to rank different forms of corrupted data with the ultimate 
goal of making informed decisions regarding to the acquisition of data sets. To do so we have introduced a 
general framework in which many corrupted learning tasks can be expressed. Furthermore, we have derived 
general upper and lower bounds for the reconstructible subset of corrupted learning problems. Finally, we 
have shown that in some examples these bounds are tight enough to be of use and that they produce the 
quantities one would expect. These bounds facilitate the ranking of different corrupted data, either through 
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the use of best case lower bounds or worst case upper bounds. We have shown both loss specific and worst 
case as the loss is varied bounds. Future work will attempt to further refine these methods as well as extend 
the framework to non reconstructible problems such as multiple instance learning and learning with label 
proportions. Theorems [3] and [6] provide means of choosing between data sets that feature collections of 
different corrupted data. 
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10 Appendix 

10.1 Examples 

We now show examples of common corrupted learning problems. Once again, our focus is corruption of the 
labels and not the instances. Thus we work directly with losses i : Y x A —>■ K. In particular we work with 
classification problems. We present the worst case upper bound, || R* Hoc, as well as the upper bound relevant 
for 01 loss, £qi- 


10.1.1 Noisy Labels 

We consider the problem of learning from noisy binary labels Ell30il. Here <x, is the probability that class i 
is flipped. We have 

T=('- a -' /■ W = --1-f 1 '"' V 

V a -l l-O-l ) 1 - £7-1 - O-l V -O-l 1 - CT_1 J 

This yields 

I, = (1 - <J- y )t{y , a) - (Tyi{-y, a) 

1 - cr_i - ( 7 1 

The above equation is lemma 1 in l30l and is the original method of unbiased estimators. Interestingly, even 
if i is positive, i can be negative. If the noise is symmetric with a_ 1 = <Ji = a and t is 01 loss then 


£{y,a) 


lot (y,a) - o 

1 — 2(7 


which is just a rescaled and shifted version of 01 loss. If we work in the realizable setting, ie there is some 
/ £ J- with 

^(x, v )~p£oi(yJ(x)) = 0 

then the above provides an interesting correspondence between learning with symmetric label noise and 
learning under distributions with large Tsybakov margin 0 . Taking a = 1 — h with P separable in turn 
implies P has Tsybakov margin h. This means bounds developed for this setting ]27l can be transferred to 
the setting of learning with symmetric label noise. Our lower bound reproduces the results of l27l 


Below is a table of the relevant parameters for learning with noisy binary labels. These results directly 
extend those present in f24l that considered only the case of symmetric label noise. 


Learning with Label Noisy (Binary) 

T 

( 1 - cr_i CTi \ 

V CT-1 1 - CTi ) 

R* 

J ^ 

1 

1 - 1 

?f 

b 

1 

1-1 1 
b 

1 

a{T) 

|1 - <7-1 - CTi| 

Halloo 

max(l cr i + CTi,l cri+cr i) 

Poi||oo 

|i_ cr _ 1 1 _ CTl | rnax(l <r_i, 1 CTi, cr_i, CTi) 


We see that as long as cr_i + a± 1 T is reconstructible. The pattern we see in this table is quite common. 

11 -R*||oo tends to be marginally greater than with ||l?oi||oo l ess than both. In the symmetric case our 

lower bound reproduces those of (3|. 

10.1.2 Semi-Supervised Learning 

We consider the problem of semi-supervised learning lfT2fl . Here 1 — a, is the probability class i has a missing 
label. We first consider the easier symmetric case where cr_i = <j\ = cr. 
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Symmetric Semi-Supervised Learning 


T 


0 “ ) 

\ 1 — a 1 — a ) 

R * 


7 l-2a+2cr' 2 Y 

1—3cr+5cr 2 — 3<t 3 1 — 3ct+5(7 2 —3<t 3 \ 

-a 2 l-2o+2a 2 

1—3(t+5ct 2 — 3<t 3 1 — 3cr+5f7 2 — 3<t 3 1 

\ l-2(T + 3 cr 2 l-2cr + 3 cr 2 / 

a(T) 

a 

Plloo 

1 

a 

IK 01 II 00 

l-2a+2a z 

2<t+3ct — 5cr 2 


Once again ||^oi||oo < aTr)' As l° n g as G 7 ^ 0. Our lower bound confirms that in general unlabelled data 
does not help 0. Rather than using the method of unbiased estimators, one could simply throw away the 
unlabelled data leaving behind an labelled instances on average. 


Semi-Supervised Learning 



1 

O 

T 

° CTl 


\ 1 - 0 - 1 1 - CT1 / 

a{T) 

maxi Gi 


Other parameters for the more general case are omitted due to complexity (they involve the maximum of 
three 4th order rational equations). They are available in closed form. 


10.1.3 Three Class Symmetric Label Noise 

In line with m, here we present parameters for the three class variant of symmetric label noise. We have 
Y = Y = { 1,2,3} with P(Y = y\Y = y) = 1 — a, if y = y and | otherwise. 


Learning with Symmetric Label Noisy (Multiclass) 



We see that as long as a 7 ^ | T is reconstructible. Once again ||£oi||oo < ^t)- 


10.1.4 Partial Labels 

Here we follow lfl6l with Y = {1, 2, 3} and Y = {0, l} 1 the set of partial labels. A partial label of (0,1,1) 
indicates that the true label is either 2 or 3 but not 1. We assume that a partial label always includes the true 
label as one of the possibilities and furthermore that spurious labels are added with probability er. 
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Learning with Partial Labels 



We see that as long as a ^ 1 T is reconstructible. In this case ||£oi||oo and ||f?*|| 00 are given by more 
complicated expressions (however they are both available in closed form). We display their interrelation in a 
graph in below. To the best of our knowledge, there are no upper and lower bounds are present in the literature 
for this problem. 



10.2 PAC Bayesian Bounds 

PAC Bayesian bounds provide methods to assess the quality of any algorithm A : O A. All of the bounds 
presented in this section appear in lt39l . We use the shorthand H(S, a) = |^y Szes ^( z i a )- 

Theorem 8 (PAC Bayes). For all sets O, P £ P (CM, priors it £ PM), algorithms A : O A, functions 

L: Ox A-+R and j3 > 0 


E,~ P E 0 ^ W - ilog(E< E 




L(z,A(z)) + 


D kl (A(z),tt) 

P 


Furthermore with probability at least \ — 5 on a draw x~P with 7r, j3 and A fixed before the draw, 
E a~A(z) - ^ log(E ^pe-Wrt) < L(z,A(z)) + D ^(A(z), n) + log (3) 


Combined with standard bounds of the cumulant generating function, theorem[8]leads to useful generalization 
bounds. 
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Lemma 11. Let <f> : O — ► [—a, a], then for all /3 > 0 and all P 


EW>-^<-^log(E P e-W) 

Proof. See appendix A. 1 of ED. 


□ 


10.3 Proof of Theorem |3] 

Proof. Define L(5, a) = Ei=i EzeSi the sum of the corrupted losses on the sample. We have by 

theorem [ 8 ] 


Ks~qK~a(S) - p^s>~ Q e-* L(§, ’ a) ) < E, 

k 

-^og(E s „ Pi e-M^) < E 


S~Q 


i= 1 


S~Q 


L(S,A(S)) 


L(S,A(S)) 


Dkl(A(S),tt) 

P 

Dkl(A(S),tt) 

P 


where the first line follows from theorem[ 8 ]and the second from properties of the cumulant generating func¬ 
tion. Invoking lemma[TT|yields 


J2 n 4 E s^P p ^A(S))- 


mioP 


— 


i =1 


L(S,A(S)) + 


D KL (A(S), 7r) 

P 


As the Tj are reconstructible, 


E §~q£(P>A(S)) < ^E§~q 


L(S,A{S)) ■ 


D KL (A(S),n) 

P 


+ 


E n¥i 

i= 1 


P 


Optimizing over /3 yields the desired result. 


□ 


10.4 Le Cam’s Method and Minimax Lower Bounds 

The development here closely follows m with some streamlining. We consider a general learning problem 
with unknowns 0, observation space O and loss L : 0 x A —> R. For any learning algorithm A : O ®, 
we wish to lower bound the max risk 

supE,~ e(e) L(6>,.40)). 

0 

The method proceeds by reducing a general decision problem to an easier binary classification problem. First 
one considers a supremum over a restricted set { () \. 0 2 }. Using Markov’s inequality we then relate this to 
the minimum 01 loss in a particular binary classification problem. Finally one finds a lower bound for this 
quantity. With 0~{0 1 , 62 } meaning 9 is drawn uniformly at random from the set {9 1 , 02 }, we have 

supE 2 ^ e ( 0 )E a ^_ 4 ( 2 ) AL( 0 , a) > sup E 2 ^ e(e) E a ^ ( 2 ) A L{ 0 ,a) 

8 { 01 , 02 } 

> E 6 /^,{ eije 2 }E 2 ^ e ( 6 / ) E a ^,_ 4 ( 2 ) AL(0, a) 

> (5E e ^{ eii g 2 }E 2 ^ e ( e) E a ^_4( 2 )[AL(0,a) > 5]. (1) 

Recall the separation p : 0 x 0 —► R, p{9\, 02 ) = inf a AL(0i, a) + AT( 02 , a). The separation measures 
how hard it is to act well against both 9\ and 02 simultaneously. We now assume p(0 \. 02 ) > 2 6. Define 
/ : A —> {0 l5 0 2 , error} where /(a) = 0, if A L(9i, a) < S and error otherwise. This function is well defined 


15 
















as if there exists an action a with AL( 6 i, a) < 6 and AL(d 2 , a) < S then p{9 1 , 0 2 ) < 28 a contradiction. Let 
A be the classiher that first draws a~A(z) and then outputs /(a) we have 


supE z ^, e ( £ ))E a ^,_4(-)AL(0, a) > ffie~{e 1 ,0 2 }E z ~ e (e)E e( ^ ( J0 A 0 7 J 
e 

> <5 _ inf E^{e 1 , 0 2 }E 2 -e( 9 )^'-i( 2 )[# A 0'} 

yUO —>0 

where the first line is a rewriting of ( 1 ) in terms of the classifier A, the second takes an infimum over all 
classifiers and the final line is a standard result in theoretical statistics lf32| . Taking <5 = p ^ 1 2 ’ 6>2 ' ) yields lemma 

m 

10.5 Proof of Lemma |2] 

Proof. Firstly V is a metric on P(x^ = 1 Oi) l32l . Thus 

V(®tiPu®LiQi) = V(Pi ® (®*L 2 P),Qi ® (®? = 2 Qi)) 

< P(Pi ® (®i 2 P). Qi ® (®?= 2 Pi)) + P(Qi ® (®*= 2 P<), Qi«(®i= 2 Qi)) 

= P(Pi, Qt) + V{^UPu ® k i= 2 Qi) 

where the hrst line is by definition, the second as V is a metric and the third is easily verified from the 
definition of V. To complete the proof proceed inductively. □ 

10.6 Proof of Lemma |4] 

Proof. 

Df(T(P), T{Q)) = D f (XF(P) + (1 - X)G(P) 7 XF(Q) + (1 - A)G(Q)) 

< A D f (F(P), F(Q)) + (1 - X)Df(G(P),G{Q)) 

= (1 — X)D f (G(P),G(Q)) 

< (1-A )D f (P,Q) 

Where the first line follows from the definition, the second from the joint convexity of /-divergences lf32j . the 
third because F(P) = F(Q) and Df(P, P) = 0 and finally the fourth is from the standard data processing 
inequality (32). 

□ 


10.7 Proof of Lemma [5] 

The proof of the forward implication is lemma 2 of 0. We prove the reverse implication. 

Proof. As this decomposition works for all pairs of distributions we can take P = S Xi = ei and Q = S Xj = 
ej. As F(P) = F(Q) we must have Pki = Fkj = Vk for all k. As all of the entries of (1 — X)G are positive, 

we have Xvk < Tfei and Xi’k < T^j. Hence Xvk < min(Tfcj, T)y). Summing over k and remembering that F 

is column stochastic gives A < mm/T^j, 7/j). As i and j are arbitrary we have the desired result. □ 

10.8 Proof of Theorem |6] 

Proof. Let 

T = ®* =i 1 ? i = 7i ® • • • ® Ti ® T 2 ® • • • ® Tg • • • ® T fc ® • • • ® Tfc . 

ni times ri2 times rik times 
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One has T(e n {6)) =T 1 (e(d)) ni <g> T 2 {e(6)) n2 g> • • • <g> T k {e(6)) nk . By lemma|2] 


k 

y(T(e n (0 1 )),T(e n (0 2 )) < ^2n i V(T i (e(0i)),T i (e(9 2 ))) 

i—1 


< 


Y,a(Ti)n i )V(e(e 1 ),e(9 2 )). 


Now proceed as in the proof of theorem [5] 


□ 


10.9 Proof of Lemma [6] 


Proof. 


a(T 2 Ti) 


||r 2 r 1 (P)-r 2 r 1 (Q )|| 1 

P,Q£P(X) ll-P - Qlll 

||r 2 r 1 (P)-r 2 r 1 (Q )|| 1 ||r 1 (p)-r 2 (g )|[ 1 
p,<SU) ||Ti(P)-T 2 (Q)||i IIP-Olli 


< „ ||T 2 r 1 (P)-r 2 r 1 (Q)||i ||Pi(P) ~p 2 (Q)|[i 

- P,QGP(X) ||Pl(P) - T 2 (Q)\\i P.QeP(A') ll-P-Qlll 


q . n \\T 2 (P)-T 2 (Q) 111 ||T 1 (P)-r 2 (Q)||i 

p,Q&r(Y) ||P-Q||i p.Qe p(x) IIP ~ Q||i 


= a(T 2 )a(Ti) 


Where the first line follows from the definitions, the second follows if T\ (P) / T 2 (Q) and the rest are simple 
rearrangements. For the final inequality, remember that a(T) <1. □ 


10.10 Proof of Lemma [8| 

Proof. By definition H^loo = sup 20 \£(z,a)\ = sup a ||4|| 00 . Hence 


H/lloo = SUp11 £a 11oo 
a 

< sup||P*|l°°H4||cx> 

a 

= \\R*\U\e\\°o 

where the second line follows from the definition of the operator norm || 7?* ||oo - 


a 


10.11 Proof of Lemma [9] 

Proof. Firstly ||P||i = ||P*||oo IS). From the definition of ||P||i we have 


l|P||i = sup 

t i 6 R y 

> sup 

tigl- Y 


11 Pu 111 

IMU 

llP^lli 

r«iii 


= sup 

ugi Y In w IIi 


= i/ 


inf 

ueR Y 
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this proves the first inequality. Recall one of the equivalent definitions of a(T) from section 4.4 

roiii 


a(T ) = sup 


o&S |M|i 


where S = {v £ ^ Vi = 0, v 7 ^ 0}. Hence trivially inf„ eR x < a(T). 


□ 


10.12 Corrupted Learning when Clean Learning is Fast 

There are many conditions under which clean learning is fast, here we focus on the Bernstein condition 
presented in l35l . 

Definition 4. Let P £ P (O), t a loss and ap = arg min a E~~p£(z, a). A pair (£, P) satisfies the Bernstein 
condition with constant K if for all a £ A 

E z ~p((( z , a) — i(z,ap)) 2 < K E Z ~p£(z, a) — £(z, ap) 


When A is finite, such a condition leads to - rates of convergence. From results in |i39j| we have the following 
theorem. 

Theorem 9 (PAC Bayes Bernstein). Let 7 = F° r a U P> priors 7 r, algorithms A, bounded losses i 

and f3 > 0 


E s ~p~ [£(P,A(S)) - j£ 2 (P,A(S))\ < Es~pr> 


£(S,A(S)) + H^ll 


fD KL (A(S), tt) 




fin 


Furthermore with probability at least 1 — S on a draw S~P n with A, fi and 7r chosen before the draw 


£(P,A(S))-^£ 2 (P,A(S))< 


£(S, A(S)) + p||oo 


( Dkl(A(S), Tt) + log (|) V 

V fjn 


We are now in a position to show that the Bernstein condition leads to fast rates for ERM. 

Theorem 10. (Fast Rates for ERM) Let A be ERM with A finite. If (l, P) satisfies the Bernstein condition 
then for some constant C 

E s ~pnt(P,A(S))-£(P, ap ) < C1 ° g(|A|) . 

n 

Furthermore with probability at least 1 — 5 on a draw from P n one has 

£{P, A{S)) - £(P, a P ) < C ( log ^^ + 1 °g(?)) _ 

n 

Proof First, define £p(z,a) = £{z,a) — £(z,ap). Ip measures the loss relative to the best action for the 
distribution P. It is easy to verify that for bounded £, ||^p||oo < 2||£|| 00 . We now utilize theorem [9] with £p 
and 7r uniform on A. This yields 


[£ P (P,A(S))-'y£ 2 P (P,A(S))] < -E S ~ P 


n 


£p(S,A{S)) + H^pIIoo ( 1<3 ^|^)j 


( e P — 1 _ ft) 

with 7 = ■ Firstly ERM minimizes the right hand side of the bound meaning 


-Es~P" 

n 


£p(S,A(S)) + \\£p\ 




1 

< - 
n 


IK 


P OO 


^ log(|A|) ^ 
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To see this consider the algorithm that always outputs ap, this algorithm generalizes very well however it may 
be suboptimal on the sample. Secondly (£, P) satisfies the Bernstein condition with constant K. Therefore 


(1-7A0E S ~Pnl P (P,A(S)) < - 

n 



^ logd^p y 


Finally chose /3 small enough so that 7 K < 1. This can always be done as 7 -7 0 as B -7 0 + . The high 
probability version proceeds in a similar way. 

□ 


A natural question to ask is when does {£, P) satisfy the Bernstein condition? 

Theorem 11. If (£, P) satisfies the Bernstein condition with constant K then {£, P) also satisfies the Bern¬ 
stein condition with the same constant. 

Proof. 

KE z ~p£(z, a) — £(z, ap) = KE z ^p£(z, a) — £(z, ap) 

> E s ^p{i(z, a) - £(z , ap)) 2 

= E 2 ~pEj~ T( 2 )(£( 2 ,a) — £{z, a p )) 2 

> E z ~ P (E^ T{z) £(z,a) - E z~ T ( z )i(z, a P )) 2 
= E z ~ P (£(z, a) - £{z, a P )) 2 

where the first line follows from the definition of £ and because ap = ap, the second as (£, P) satisfies the 
Bernstein condition and finally we have used the convexity of /( x) = x 2 . 

□ 


This theorem (almost) rules out pathological behaviour where ERM learns quickly from corrupted data and 
yet slowly for clean data. At present it is unknown if the converse to theorem 11 is true, with the same or 
possibly different constant. Here we present a partial converse. 


Definition 5. Let T : O O be a Markov kernel and £ a loss. A pair (£, T) are 77 -compatible if for all 
z £ O and 01,02 € d 


1 ) - I(z,a 2 )) 2 < r](£{z,ai) - £{z,a 2 ))' 2 . 

Theorem 12. If the pair (£, P) satisfies the Bernstein condition with constant K and the pair {£, T) are 
p-compatihle then ( l , P) satisfies the Bernstein condition with constant 77 K. 


Proof. 

E-^p(£(z,a) - £(z,a P )) 2 = E 2 ~ P E 2 ~ T ( z )(£(z,a) - £(z,a P )) 2 

< ijE z ~p(£(z, a) - £(z, a P )) 2 

< pKE z ~p£(z, a) — £(z, ap) 

= 7jKE s ^p£(z, a) — £(z , ap) 

where we have first used 77 -compatibility, then the fact that {£, P) satisfies the Bernstein condition with 
constant K and finally the definition of £. 

□ 


While by no means the final line in fast corrupted learning, this theorem does allow one to prove interesting 
results in the binary classification setting. 
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Theorem 13. Let T be label noise, T = 


1 - <T_l 

O '—1 


0-1 

1-0-1 


, then the pair (foi, T) is rj-compatible with 


Proof. Due to the symmetry of the left and right hand sides of the Bernstein condition, one only needs to 
check the case where «i = 1, 02 = —1. Recall 


2 x (1 - cr- y )£oi{y,a) - (T y e 01 (-y,a) 

1 - O'- i - O'! 

_ (1 - <7-y + <Ty)toi(y,a) - CTy 

1 ~ <7-1 - 01 

For y = 1 it is easy to confirm (^qi( 1, 1) — £oi(l,—l)) 2 = 1. We have 

Mi, 1 ) - MS, -1) = ( 1 -*-» + ^><MS.i)-MS.-i)) 

1 - 0_i - 0i 

= -y( 1 - <T-y + cr y) 

1 - 0_i - 01 

Squaring, taking maximums and finally expectations yields the desired result. 

□ 

One very useful example of a pair (P, £) satisfying the Bernstein condition with constant 1 is when P is 
separable, £ is 01 loss and the Bayes optimal classifier is in the function class. Theorem[l3l guarantees that in 
such a setting one can learn at a fast rate from noisy examples. 
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